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Abstract 

The sudden appearance and potential lethality of severe acute respiratory syndrome associated coronavirus (SARS-CoV) in humans has 
focused attention on understanding its origins. Here, we assess phylogenetic relationships for the SARS-CoV lineage as well as the history 
of host-species shifts for SARS-CoV and other coronaviruses. We used a Bayesian phylogenetic inference approach with sliding window 
analyses of three SARS-CoV proteins: RNA dependent RNA polymerase (RDRP), nucleocapsid (N) and spike (S). Conservation of RDRP 
allowed us to use a set of Arteriviridae taxa to root the Coronaviridae phylogeny. We found strong evidence for a recombination breakpoint 
within SARS-CoV RDRP, based on different, well supported trees for a 5 ' fragment (supporting SARS-CoV as sister to a clade including 
all other coronaviruses) and a 3' fragment (supporting SARS-CoV as sister to group three avian coronaviruses). These different topologies 
are statistically significant: the optimal 5 ' tree could be rejected for the 3 ' region, and the optimal 3 ' tree could be rejected for the 5 ' region. 
We did not find statistical evidence for recombination in analyses of N and S, as there is little signal to differentiate among alternative trees. 
Comparison of phylogenetic trees for 11 known host-species and 36 coronaviruses, representing coronavirus groups 1-3 and SARS-CoV, 
based on N showed statistical incongruence indicating multiple host-species shifts for coronaviruses. Inference of host-species associations 
is highly sensitive to sampling and must be considered cautiously. However, current sampling suggests host-species shifts between mouse 
and rat, chicken and turkey, mammals and manx shearwater, and humans and other mammals. The sister relationship between avian 
coronaviruses and the 3 ' RDRP fragment of SARS-CoV suggests an additional host-species shift. Demonstration of recombination in the 
SARS-CoV lineage indicates its potential for rapid unpredictable change, a potentially important challenge for public health management 
and for drug and vaccine development. 

© 2003 Elsevier B.V. All rights reserved. 
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1. Introduction 

The sudden appearance and potential lethality of se¬ 
vere acute respiratory syndrome associated coronavirus 
(SARS-CoV) in humans has focused attention on under¬ 
standing its origins. The host reservoir from which humans 
were infected remains to be determined. However, molec¬ 
ular phylogenetics can be used to assess SARS-CoV’s evo¬ 
lutionary origin and history of change by analyzing genes 
from SARS-CoV with homologous genes from other coro¬ 
naviruses. Though surveys and sampling of coronaviruses 
from both wild and domestic host-species are limiting, com¬ 
parative phylogenetic analyses for viruses and hosts is im¬ 
portant in elucidating the history of host associations as well. 
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Coronaviruses have been divided into three groups based 
on serological and genetic criteria (Siddell, 1995). To date, 
group 3 coronaviruses have been found only in birds, group 
1 coronaviruses have been found in carnivores, cetartio- 
dactyls and primates, and group 2 coronaviruses have been 
found in cetartiodactyls, perissodactyls, rodents, and birds. 
Previous phylogenetic analyses, all of which were unrooted, 
suggested that SARS-CoV represents a relatively early di¬ 
verging coronavirus lineage equally distantly related to the 
three groups of coronaviruses noted above. On this basis, 
SARS-CoV was proposed as representing a fourth, distinct 
group within the genus Coronavirus (Marra et al., 2003; 
Rota et al., 2003). These previous studies, which focused 
on characterizing and sequencing SARS-CoV, did not yield 
evidence for recombination within the SARS-CoV genome, 
although Marra et al. (2003) commented that the s2m motif 
within the SARS-CoV UTR may be the product of horizon¬ 
tal transfer, given the disjunct presence of s2m in many if not 
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all astroviruses, a picornavirus (ERBV) and only one group 
3 coronavirus (avian infectious bronchitis viruses (IBV)). 

Here, we use phylogenetic analyses of SARS-CoV and 
other coronaviruses, rooted with diverse viruses from the 
family Artiviridae, to show that the RNA dependent RNA 
polymerase (RDRP) of SARS-CoV is a recombinant. We 
also compare phylogenetic trees for known coronaviruses 
and their hosts to assess the history of host associations for 
SARS-CoV and other coronaviruses. 

2. Materials and methods 

2.7. Identification and alignment of proteins 

To identify and align three SARS-CoV proteins with ho¬ 
mologs in the non-redundant GenBank CDS translations 
(also includes PDB, SwissProt and PIR; 20 April 2003), 
we used PFAM hidden Markov models (HMM) (Bateman 
et al., 2002) with the software HMMER (Eddy, 1998). The 
HMMs we used are: PF05183 for RNA dependent RNA 
polymerase, PF00937 for nucleocapsid (N) and PF01601 for 
spike (S). For identical RDRP sequences, we only retained a 
single representative. For N and S we used BFASTCFUST 
to retain a single representative from 95% identity groups to 
reduce the abundance of these sequences for computational 
efficiency. Envelope and membrane proteins were not ana¬ 
lyzed because of their short size and lack of conservation 
across coronaviruses. 

2.2. Detection of recombination within genes and 
phylogenetic analyses 

To detect recombination within RDRP, S and N genes, 
we used sliding window phylogenetic analyses where win¬ 
dows of 100 amino acids in 25 amino acid intervals were 
analyzed using Bayesian inference (BI) (Mau et al., 1999; 
Yang and Rannala, 1997). This approach is analogous to 
bootscanning (e.g. Salminen et al., 1995), however, we use 
Bayesian inference rather than neighbor joining (NJ) and 
amino acid sequences rather than nucleotides. For BI, four 
chains were run for 200 K generations with a 100 K genera¬ 
tion burn-in using a y distribution of rates and the transition 
matrices WAG for S and N, and rtREV for RDRP (Dimmic 
et al., 2002) in MrBayes v.3b4 (Huelsenbeck and Ronquist, 
2001) and summarized as a 50% majority rule consensus 
tree. We used the differential phylogenetic position of differ¬ 
ent SARS-CoV gene fragments (‘windows’) with respect to 
other coronavirus groups to identify potential recombination 
breakpoints and to divide the alignment into segments for 
additional phylogenetic analyses (Fig. 2A for RDRP) with 
BI and NJ bootstrap. For the analysis of these segments, BI 
parameters were as above, except each chain was run for 1 
million generations. For NJ bootstrap, we used 1000 boot¬ 
strap replicates and NJ searches under default parameters in 
PAUP*, summarized as a 50% majority rule consensus tree. 


We used the approximately unbiased (AU) test (Shimodaira, 
2002) to assess the validity of these breakpoints by deter¬ 
mining whether alternative phylogenetic placements for dif¬ 
ferent SARS-CoV gene regions can be statistically rejected 
using the program CONSEF (Shimodaira and Hasegawa, 
2001) with branch lengths and model parameters estimated 
in PAMF (Yang, 1997). 

2.3. Host association 

In order to evaluate the types of evolutionary events (co¬ 
divergence, duplication, sorting, host switching) that explain 
the fit between coronavirus evolution and host evolution, 
we considered the nucleocapsid coronavirus phylogeny and 
its host phylogeny, where mammalian relationships in the 
host tree follow Murphy et al. (2001). We used TreeFitter 
v. 1 (Ronquist, 2000) which incorporates differential costs to 
the four types of potential events of a host-parasite asso¬ 
ciation: codivergence (C), duplication (D), sorting (S) and 
host switching (H). We used various event costs to test a 
variety of situations (see Desdevises et al., 2002; Ronquist 
and Filjeblad, 2001). Significance of fit was determined by 
comparing the cost of the observed tree with 10,000 random 
permutations of the coronavirus tree terminals. 

3. Results 

3.1. Recombination within RDRP 

The RDRP HMM detected 27 unique sequences in Gen- 
Bank related to SARS-CoV from Arteriviridae and Coro- 
naviridae. The relationship between SARS-CoV and these 
other coronaviruses for each 100 amino acid window in 
RDRP, as indicated by BI phylogeny, is shown in Fig. 1A. 
Three contiguous, overlapping windows spanning 150 amino 
acids in the 5' region of the SARS-CoV RDRP are sister 
to a clade including groups 1-3 (all other known coron¬ 
aviruses). Alternatively, seven contiguous windows, span¬ 
ning 259 amino acids in the 3' region, are sister to group 3 
coronaviruses. Using this diagram (Fig. 1 A), we split RDRP 
into two fragments, 5 r and 3'. 

To assess the significance of this inference, we performed 
extensive phylogenetic analyses on each fragment. The 
optimal tree for the 5' region (Fig. IB) and the 3' region 
(Fig. 1C) mirrored the results of the sliding window analysis 
(Fig. 1 A). To assess the potential impact of the outgroups on 
these results we also analyzed both regions for the 12 coro¬ 
navirus taxa alone. These unrooted topologies (not shown) 
are compatible with the rooted topologies, indicating that the 
results for SARS-CoV in Fig. 1 are not due to long branch 
attraction involving the outgroup. We then used the approxi¬ 
mately unbiased (AU) tree selection test (Shimodaira, 2002) 
to see if the alternative, competing trees for each gene frag¬ 
ment can be statistically rejected in favor of the optimal tree, 
or if the conflicting results between the 5' and 3' regions 
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Fig. 1. Recombinant nature of SARS-CoV RNA dependent RNA polymerase, as indicated by different sister relationships with other coronaviruses for 
different gene regions. (A) Schematic diagram showing Bayesian inference (BI) sliding window (100 amino acids long in 25 amino acid intervals) analyses 
used to assess recombination breakpoints within RDRR The sister relationship of SARS-CoV RDRP with other coronavirus groups (1, 2 and/or 3) for 
each fragment is indicated by color code and numbers. BI phylogenies are shown for the entire 5' (B) and 3 ' (C) regions of RDRP. Numbers by each node 
are posterior probabilities, and when applicable, are followed by neighbor joining bootstrap percentages in italics. Multiple terminal nodes from a single 
virus species are represented by a black triangle, with the number of terminals indicated in white numerals. Genlnfo identifiers for the proteins used in 
this analysis: 482297, 564004, 7769353, 93916, 233625, 14917044, 6625761, 13752450, 10242469, 94017, 12744851, 10179430, 25121660, 20271248, 
11878197, 7650194, 17529672, 11878201, 25361011, 26008080, 12082740, 133455, 29293454, 14250963, 12240326, 10181074, 9635157, 29837504. 
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Fig. 2. Results of the AU topological test (Shimodaira, 2002) for alternative 
trees based on the 5 r and 3' RDRP putative recombinant fragments. 
Compatible topologies for the 5' and 3' fragments are located in each 
row. Putative recombinant fragments were inferred from the results of 
the sliding window analysis shown in Fig. 2B. The topologies shown are 
summaries, with each group represented by a single terminal taxa; see 
Fig. 1 for details. 


are not statistically significant. As shown in Fig. 2, the op¬ 
timal sister relationship for the 5' region, with SARS-CoV 
as sister to a clade including groups 1-3 combined, is re- 
jectable for the 3' region, and the optimal sister relationship 
for the 3' region, with SARS-CoV sister to group 3, is re- 
jectable for the 5' region. Additionally, SARS-CoV RDRP 
as sister to group 1 RDRP is rejectable in both regions, and 
SARS-CoV as sister to group 2 is rejectable in the 3' region. 
To assess the impact of the outgroups on these results, we 
repeated the AU tests with only the 12 coronavirus taxa. 
All alternative topologies for both regions were rejectable 
in favor of the optimal topology according to the AU test. 

For S, the HMM detected 120 unique relatives of 
SARS-CoV, which we reduced to 24 by 95% identity clus¬ 
tering. For N, the HMM detected 93 unique relatives of 
SARS-CoV, which we reduced to 32 by identity clustering. 
For S and N, we were not able to reject alternative topolo¬ 
gies for segments when following the above procedure 
(results not shown) and thus considered each gene as a his¬ 
torical unit for further analysis. According to their HMMs, 
both S and N are too variable to allow inclusion of an 
outgroup in alignment and phylogenetic analyses, therefore 
their phylogenies are unrooted. For N (Fig. 3) and S (not 
shown) coronavirus groups 1-3 are each monophyletic with 
respect to SARS-CoV, however, we cannot say with statis¬ 


tical confidence (according to the AU test) which group (1, 
2 or 3) is most closely related to SARS-CoV. 

3.2. Host-shifts 

Examination of the fit of the N virus tree to the host tree 
was performed in TreeFitter v.l (Ronquist, 2000). Under 
default settings (H = 2, S = 1, D = 0, C = 0), nine host 
switches (P <$C 0.001) describe a significant fit (P 0.001) 
of the virus and host trees, while codivergences, duplications 
and sorting events are rare (P 0.05). As further evidence 
of this, when the program settings are changed to maximize 
codivergence events (H = 0; or H = 0 and C = — 1) the 
global fit between the two trees is no longer significant (P 
0.05). Together, these results indicate that, given current 
sampling, host switches have been extremely important in 
the evolution of coronaviruses and their hosts. 


4. Discussion 

4.1. Phylogeny and recombination 

The difference in the phylogenies inferred from the 5' 
(Fig. IB) and 3' (Fig. 1C) RDRP regions, and the significant 
differences in support and rejectability of alternative trees 
for each gene region, all strongly support the hypothesis of 
an ancient recombination event between two co-infecting 
viruses. We say ‘ancient’ to denote that both regions are 
sister to clades of other sequences, rather than to any sin¬ 
gle recently diverged sequence. These results indicate that 
the two SARS-CoV RDRP regions do indeed represent two 
unique histories. Thus, it is preferable not to analyze them 
together, as has been done in previous analyses (Marra et al., 
2003; Rota et al., 2003) because their history cannot be rep¬ 
resented by a single tree. These previous analyses used dis¬ 
tance methods (NJ) less able to accommodate heterogeneity 
in rates of sequence character change, and found RDRP as a 
whole to be closest to group 2 coronaviruses. This may result 
from conflicts within the data stemming from recombination 
and/or from effects of rate heterogeneity. The authors do not 
report whether alternative trees could be rejected based on 
their analyses of multiple genes, though it seems unlikely, 
as when we performed similar analyses for the S and N pro¬ 
teins we were not able to differentiate between alternative 
SARS-CoV sister relationships using the AU test. We note 
that the approach for detecting recombination implemented 
here is rigorous in comparison to traditional bootscanning, in 
that it analyzes more conserved amino acids using Bayesian 
inference rather than NJ, and explicitly tests the signif¬ 
icance of alternative topologies using the AU test. It is 
possible and likely that more recombination events have 
happened within RDRP, N, S or other SARS-CoV genes, 
than we have detected here, although the evidence for re¬ 
combination generally becomes more difficult to discern 
over time. 
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Host phylogeny 


Nucleocapsid phylogeny 



et al., 2001 for mammals). Lines drawn between the two phylogenies indicate the host status of each coronavirus. For the nucleocapsid phylogeny, all 


nodes are supported by >50% Bayesian posterior probability. Nodes overlaid with circles are also supported by >75% of neighbor joining bootstraps. 
Genlnfo identifiers for the proteins used in this analysis: 1220375, 395178, 127872, 11096193, 543643, 3132999, 1515361, 21624372, 222585, 29840828, 
13448682, 28916465, 281107, 74863, 28460530, 11640712, 29836503, 14253137, 1515365, 1515367, 6689852, 6689856, 320020, 1515375, 1515373, 
1515371, 331869, 547999, 21624295, 21624366, 21624369, 28932648, 28932650. 


Inclusion of an outgroup, as we have done with RDRP 
from 15 Arteriviridae taxa (Fig. 1), allows inference of the 
sister relationships and the relative age and timing of Coro- 
naviridae (Coronavirus and Torovirus ) divergence events, 
missing from the previous unrooted analyses which only 
could assess distance between clades. Our rooted analyses 
indicate that the 5' RDRP fragment diverged from other 
Coronavirus taxa prior to divergences between and within 
groups 1-3. Fig. 1C indicates that the 3' RDRP fragment 
diverged from other coronavirus homologs more recently, 
after divergences between and within groups 1-3. Interest¬ 
ingly, Fig. 1C also shows non-monophyly for group 1 coro- 
naviruses. This is not surprising, given that groups 1-3 were 
initially distinguished based on serological tests rather than 
phylogenetic analyses and given the capacity for recombi¬ 
nation. 


The sister relationship between the more recently diverged 
SARS-CoV 3' RDRP fragment and group 3 avian infectious 
bronchitis viruses (Fig. 1C), suggests that potential horizon¬ 
tal transmissions of s2m to SARS-CoV (Marra et al., 2003) 
and the 3' region of RDRP are correlated. They may have 
even been incorporated concomitantly, perhaps on transmis¬ 
sion from an ancestor of IB V, the only coronavirus with s2m 
(Jonassen et al., 1998). As the 5' region of RDRP and the 
s2m motif are disjunct in the SARS-CoV genome, putative 
replication-dependent recombination would have involved 
several consecutive template switches, as was inferred for 
the transfer of s2m to IBV (Jonassen et al., 1998). Alterna¬ 
tively, horizontal transfer from astroviruses or a picornavirus 
cannot be ruled out, as horizontal transfer of s2m accounts 
for its presence in three different virus families (Jonassen 
et al., 1998). 
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Phylogenetic indication of recombination for SARS-CoV 
makes sense, as coronaviruses are unique among single- 
stranded, non-segmented RNA viruses in their propensity 
for recombination, a mechanism purportedly useful in elim¬ 
inating frequent deleterious mutations in large RNA viruses 
(Lai, 1996). Coronavirus genomes generally appear resilient 
and able to tolerate deletions, insertions and rearrangements 
(de Haan et al., 2002). Similar to our findings, Decimo et al. 
(1993) suggested that N genes from murine hepatitis viruses 
were the result of double recombination, and several authors 
have reported evidence for recombination among natural iso¬ 
lates of IBV (e.g. Jia et al., 1995; Wang et al., 1993). Re¬ 
combination can be important in gain of novel functions. For 
example, in HIV recombination is considered to be a pow¬ 
erful adaptive mechanism for antiviral agent resistance and 
cytotoxic T-cell escape (e.g. Morris et al., 1999). The com¬ 
bination of horizontal transfer and recombination results in 
complex phylogenies that may blur the evolutionary history 
of genes (e.g. Keeling and Palmer, 2001; Rest and Mindell, 
2003), especially since horizontal transfer and recombina¬ 
tion are often associated processes. 

4.2. Host association 

Despite the limitations imposed on inference of histori¬ 
cal host association by the restricted sampling of hosts and 
coronaviruses to date, some preliminary observations can 
be made. Coronaviruses have been shown to be particularly 
host specific (Lai, 1990; Sturman and Holmes, 1983) and it 
had been assumed by some that they coevolved (diversified 
in tandem) with their hosts (Decimo et al., 1993). How¬ 
ever, based on current sampling and analyses summarized 
in Fig. 3, some host switching events are implicated in ac¬ 
counting for incongruence between the host and coronavirus 
phylogenies. For example, chicken and turkey are sister 
taxa within the host phylogeny, yet isolates from each do 
not form host-specific monophyletic groups. Rather, some 
isolates from chicken are most closely related to turkey 
isolates, suggesting host-shifts for coronaviruses between 
these two bird species (Fig. 3). A third avian species, the 
manx shearwater (Puffinus puffinus ) is the host to a group 
2 coronavirus (Kirkwood et al., 1995), which are otherwise 
known only from mammals. The phylogenetically nested 
position of this avian coronavirus within mammalian-host 
isolates suggests a possible bird-mammal host-shift. Coro¬ 
navirus host-shifts between mouse and rat are implicated in 
the same manner as for chicken and turkey; isolates from 
the two rodent species are not reciprocally monophyletic. 
Similarly, isolates from pig are not monophyletic, including 
two group 1 and one group 2 coronavirus in Fig. 3. With 
the outbreak of SARS, coronavirus isolates from humans 
are also non-monophyletic. 

Using an earlier and more limited sampling of host species 
(n = 6) and coronaviruses (n = 9), Decimo et al. (1993) 
suggested a host-shift between cats and pigs. Inclusion of ad¬ 
ditional host and virus sampling in Fig. 3, including two iso¬ 


lates from dogs, implicates potential host-shift between dogs 
and pigs, rather than cats and pigs. This demonstrates sensi¬ 
tivity to sampling, although the implication of host-shifting 
as a phenomenon remains. The emerging picture of coro¬ 
navirus host associations is increasingly indicative of 
host-shifts. This is supported in Fig. 3 by the observations 
mentioned above as well as (1) the sister relationship be¬ 
tween human coronavirus 229E and porcine (pig) epidemic 
diarrhea virus and (2) non-monophyly for human coron¬ 
avirus 229E and SARS-CoV. Further, according to statistical 
tests in TreeFitter, allowing host-shifts to occur results in sig¬ 
nificant fit while duplication, codivergence and sorting play 
no detectable role. Depending on the assigned costs, TreeFit¬ 
ter estimates between 9 and 15 host-shifts in reconciling the 
host and coronavirus phylogenies shown in Fig. 3, though 
the actual number is unknown. In light of the genomic dis¬ 
parity among diverse coronaviruses from some individual 
host-species (e.g. humans, pigs), it seems unlikely that in¬ 
creased sampling of coronaviruses will yield monophyly for 
all isolates from each of the individual host-species in Fig. 3. 

The finding of recombination for SARS-CoV RDRP, 
the relatively early phylogenetic divergence for RDRP 
fragments (prior to the most recent divergences within 
coronavirus groups 1-3), as well as the inference of mul¬ 
tiple coronavirus hosts switches, suggests that SARS-CoV 
belongs to an old, potentially diverse, and changeable coro¬ 
navirus lineage that remains to be discovered in its natural 
hosts. Demonstration of recombination in the SARS asso¬ 
ciated coronavirus lineage indicates its potential for rapid 
unpredictable change, a potentially important challenge for 
public health management and for drug and vaccine devel¬ 
opment. The known non-human coronaviruses come from 
only nine, mostly domestic, mammal or bird species, and 
searches for the zoonotic reservoir might reasonably focus 
on other species, including non-domesticated animals, that 
are used as food for humans in the geographic region of the 
SARS outbreak. 
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